Identifying lung-specific genes for each cell type¶

Procedures¶

  1. Perform QCs separately for lung and spleen datasets
  2. Integrate two datasets for joint downstream analyses
    • Identify anchors that link cells from either dataset
  3. Perform dimensional reduction and clustering
  4. Annotate clusters with CellTypist
  5. DE analysis across cell types
  6. DE analysis across tissues
  7. Examine specific genes
  8. GO analysis

Sample Information¶

image.png

1 Performing QCs separately for lungs and spleens¶

Aggregated lung samples¶

Loading required package: Signac

Aggregated spleen samples¶

2 Integraion of lung and spleen datasets¶

The joint analysis of multiple single-cell datasets is achieved by identifying cross-dataset pairs of cells that have shared cell populations, named as anchors. This helps address two problems:

  1. correct for technical differences between datasets
  2. perform comparative analaysis across experimental conditions
In [ ]:
#Data normalization and variable feature selection
lung.list <- SplitObject(lung, split.by = "orig.ident")
lung.list <- lapply(X = lung.list, FUN = function(x) {
    x <- NormalizeData(x, verbose = FALSE)
    x <- FindVariableFeatures(x, verbose = FALSE)
})

#select integration features
features <- SelectIntegrationFeatures(object.list = lung.list)
lung.list <- lapply(X = lung.list, FUN = function(x) {
    x <- ScaleData(x, features = features, verbose = FALSE)
    x <- RunPCA(x, features = features, verbose = FALSE)
})

#iteratively find anchors that link cells across two datasets
anchors <- FindIntegrationAnchors(object.list = lung.list, reference = c(1, 2), reduction = "rpca",
    dims = 1:20)#1:50
lung.integrated <- IntegrateData(anchorset = anchors, dims = 1:20)

3 Dimensional reduction and clustering on all cells¶

With the the integration approach, we expect technical differences such as batch effects should be corrected. To confirm that, we check the clustering results for the same tissue across batches. When split by tissue, the top plot shows the same tissue samples from different batches look well mixed. When split by tissue and batch, the bottom clustering results show no distinct clusters coming from one batch or one donor.

Graph-based clustering

In [12]:
p2

4 Annotate cell types with CellTypist¶

Cell type compositions¶

Major cell types

T cell subsets

In [11]:
immune.combined
Loading required package: Signac

An object of class Seurat 
321136 features across 53647 samples within 4 assays 
Active assay: RNA (36601 features, 0 variable features)
 3 other assays present: ATAC, integrated, peaks
 3 dimensional reductions calculated: pca, umap, lsi

5 DE analysis across cell types¶

Picture1.png

6 DE analysis across tissues¶

Differential expression test was performed on normalized values stored in immune.combine[["RNA"]]@data. Here are the normalization steps:

  1. Normalize by the total expression for each cell
  2. multiply by a scale factor 10,000
  3. log-transform the result

Volcano plots and QQ plots comparing against null

In [10]:
p1 + p2
In [ ]:
eg = bitr(rownames(sig.genes), fromType = "SYMBOL", toType="ENTREZID", OrgDb = "org.Hs.eg.db")
    all.eg = bitr(rownames(DEG.test), fromType = "SYMBOL", toType="ENTREZID", OrgDb = "org.Hs.eg.db")

    go_list <- enrichGO(gene          = eg$ENTREZID,
                    universe      = names(all.eg$ENTREZID),
                    OrgDb         = org.Hs.eg.db,
                    ont           = "BP",
                    pAdjustMethod = "BH",
                    pvalueCutoff  = 0.01,
                    qvalueCutoff  = 0.05,
                    readable      = TRUE)
In [8]:
library(forcats)
library(enrichplot)
go_list<-mutate(go_list, richFactor = Count / as.numeric(sub("/\\d+", "", BgRatio)))

ggplot(go_list, showCategory = 20, 
                        aes(richFactor, fct_reorder(Description, richFactor))) + 
                            geom_segment(aes(xend=0, yend = Description)) +
                            geom_point(aes(color=qvalue, size = Count)) +
                            scale_color_viridis_c(guide=guide_colorbar(reverse=TRUE)) +
                            scale_size_continuous(range=c(2, 10)) +
                            theme_minimal() + theme(axis.text.y=element_text(size=12)) + 
                            xlab("rich factor") +
                            ylab(NULL)

Visualizing results via volcano plots¶

Color scheme:
Red: negative log2FC (Upregulated in lungs)
blue: positive log2FC (Upregulated in spleens)
grey: data points not passing bonferroni corrected cutoff 0.05

DE results for major cell types with MAST

In [19]:
nonT<-marrangeGrob(volcano_plots[c(1,2,3,7,9,12)], nrow=2, ncol=3)
nonT
[[1]]
NULL

DE results for major cell types with Wilcox ranksum test

In [49]:
#wilcoxon test
nonT<-marrangeGrob(volcano_plots[c(1,2,3,7,9,12)], nrow=2, ncol=3)
nonT
[[1]]
NULL

DE results for T cell susbets with MAST

[[1]]
NULL

DE results for T cell susbets with Wilcoxon rank sum test

In [42]:
T<-marrangeGrob(volcano_plots[c(4,5,6,10)], nrow=2, ncol=3)
T
[[1]]
NULL

Summerizing number of DEGs by direction¶

In [53]:
ggplot(counts.tbl, 
       aes(x = c, y=n, 
           fill=labels, label=n,
           width=0.5)) + 
geom_col(position="dodge2") + geom_label(size=5,position = position_dodge2(width=0.8)) + 
theme_light(base_size = 15) + labs(y="DE genes") + ggtitle("DE with Wilcoxon rank sum test") + 
scale_fill_discrete(labels=c("Up-regulated in spleens",
                             "Up-regulated in lungs")) +
theme(axis.text.x=element_text(size=16, vjust=0.6, angle = 45))

Top DE genes labeled on Volcano plots¶

Warning message:
“ggrepel: 167 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 94 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 88 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 237 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
In [12]:
## wilcoxon ranksum test
marrangeGrob(vplots, nrow=2, ncol=2)
Warning message:
“ggrepel: 224 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 277 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 86 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 49 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
[[1]]
NULL
Warning message:
“ggrepel: 221 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 295 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 63 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Warning message:
“ggrepel: 39 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
[[1]]
NULL

7 Examine specific genes¶

Heatshock genes¶

The differentially expressed heatshock genes do not seem to be driven by specific batch or donor. We see consistently higher expression of heatshock genes in lungs(red) than spleens(blue) across cell types.

Marker genes for tissue-resident memory T cells¶

  • CD69, RGS1, LMNA, TGCC DUSP6, SOCS1
In [210]:
p1+p2

8 Gene set enrichment analysis via ClusterProfiler¶

Identify the known pathways that are enriched with the differentially expressed genes in lungs

TO DO: Use WebGesalt to remove redundant GO terms

Summerize top 20 GO terms in molecular functions¶

  • rich factor as ratio between #DE genes in GO terms and #non-DE genes in the background
  • The brighter the color of dots shows, the test becomes more significant
[[1]]
NULL
[[1]]
NULL
[[1]]
NULL